Federation for parallel searching

ABSTRACT

A search engine can be configured to improve search times by implementing a parallel computing architecture. The index is split across a plurality of independent search nodes. A query is communicated to each of the parallel nodes and the query is searched in each independent node. The results from each node are routed to a federator that is configured to aggregate the results to a result set. The federator is configured to determine a subset of intermediate results to retrieve and aggregate from each of the independent nodes. The federator determines a number of results to retrieve from each of the nodes based at least in part on the number of nodes and the number of search results desired in the results set.

CROSS-REFERENCES TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No.60/616,743 filed Oct. 6, 2004, titled “IMPROVED WORD DELETION FORSEARCHES,” U.S. Provisional Application No. 60/616,788 filed Oct. 6,2004, titled “IMPROVED MEASURE OF PROBABLE RELEVANCE OF CATEGORY TOQUERIES FOR SEARCHES,” U.S. Provisional Application No. 60/616,744 filedOct. 6, 2004, titled “EFFICIENT FEDERATION FOR PARALLEL SEARCHING,” U.S.Provisional Application No. 60/616,745 filed Oct. 6, 2004, titled“SEARCH INDEX KEY RANDOMIZATION,” each of which is hereby incorporatedherein by reference in its entirety.

BACKGROUND OF THE INVENTION

This disclosure relates to computer-aided searching, and in particularto networked systems that match queries to sets of documents and returna list of matching documents. It is particularly suited for Internetsites that match queries submitted from web browsers to product datastored on one or more servers. Accordingly, to illustrate the principlesof the invention, it will be described in terms of an Internet-basedshopping service. Those skilled in the art will appreciate thatcomputer-aided searching has many applications and that improvementsinspired by one application have broad utility in diverse applicationsthat employ search technology.

Computers, networks, and the Internet have become a standard part ofsocial and business infrastructure. Business now depends on the Internetjust as it depends on telephone service. Many companies now sell goodsand services online—sometimes from their own web sites; sometimes fromintermediate sites similar to retail stores; and sometimes fromhigh-level shopping sites that match shopper requests with productinformation. The use of computers allows an online shopping service tooffer the buyer and the seller advantages that would be impossible fortraditional retailers.

Many of these advantages derive from the combination of data storagetechnology and search engine technology. The shopping service or otherservice provider maintains a large library of information aboutproducts, services, and special offers. The shopping service alsomaintains a front-end web site so that shoppers can seek and seeinformation from the underlying electronic library. The large volume ofdata almost obligates the use of a search engine to pinpoint relevantdata. Typically, the web site provides a means for the shopper toindicate their area of interest, for example, by entering one or morekeywords, attributes, or restrictions from a web page. The shopper's webbrowser sends the text or other data over the Internet or other networkto the shopping service, where a search engine evaluates the request,compares it an index of documents to identify relevant items, and ranksthe results by some measure of relevance. After formatting the resultsfor display and adding text, pictures, scripts, and other data, theshopping service sends the results back over the network to theshopper's web browser. The shopping service can add value to the rawdata, for example, by collecting data from many sellers and by supplyingproduct evaluations, merchant evaluations, comparative pricing, and soon.

The practicality of these systems for shoppers, sellers, and theshopping service depends on, among other things, the speed of thesystem, the cost of the system, and the accuracy of the results. Thepopularity of the Internet as a meeting place for buyers and sellersmeans the databases of goods, services, and offers can becomeenormous—and so can the volume of requests, processing, and results.Unfortunately, each request is an unpredictable event, so the searchengine must process incoming requests on the fly, in real time. And itmust return results almost instantly, so that each shopper can interactwith the site spontaneously. On top of that, it must return an accurate,appropriate reply to an imprecise, unstructured request.

Some search engines require the search originator to use an exact,technical syntax such as Boolean syntax to form each request. Suchsearch engines allow a skilled searcher to formulate extremely specificrequests and obtain extremely specific search results. Such searchengines can be difficult for unskilled searchers to understand andoperate, however—and even skilled searchers can fail to find salientresults as a result of minor technical flaws in forming requests. Suchsearch engines are relatively simple to implement because searcherassumes the burden of deducing and expressing the search logic, andbecause each request closely resembles the corresponding query executedwithin the search engine.

Other search engines permit natural-language or free-form requests. Andsuch search engines often permit the use of short-cut indicators orprompts such as check boxes, drop-down lists, or other user-interfacedevices that indicate keywords, restrictions, or other attributestransmitted as elements of a search request. Such search engines aregenerally easier to use, especially for the unskilled searcher. Suchsearch engines are generally more difficult to implement, however,because the search engine must perform sophisticated processing toderive logical queries from requests.

The ongoing expansion of online business creates an ongoing need toimprove the systems that fulfill requests for information. Some advancesreduce the response time of the data source, search engine, or both; forexample, by reducing the computational cost of the components thatevaluate and fulfill requests. Other advances increase the accuracy ofthe data returned, for example, by refining the methods used to select,rank, and format it. Gains in speed often enable gains in quality, sincetime saved in one functional area can be put to good use in another.

BRIEF SUMMARY OF THE INVENTION

A search engine can be configured to improve search times byimplementing a parallel computing architecture. The index is splitacross a plurality of independent search nodes. A query is communicatedto each of the parallel nodes and the query is searched in eachindependent node. The results from each node are routed to a federatorthat is configured to aggregate the results to a result set. Thefederator is configured to determine a subset of intermediate results toretrieve and aggregate from each of the independent nodes. The federatordetermines a number of results to retrieve from each of the nodes basedat least in part on the number of nodes and the number of search resultsdesired in the results set.

The disclosure includes a method of processing a user query. The methodincludes obtaining a query from a user, distributing the query to aplurality of independent nodes, each node configured to search acorresponding segment of an index of the corpus of documents, runningthe query in each of the nodes against the corresponding segment of theindex of the corpus of documents to obtain from each of the nodes anintermediate results list of one or more documents in the documentcorpus deemed responsive to the query, determining a maximum number ofresults to retrieve from each of the intermediate results lists, andaggregating a number of results from each of the intermediate resultslists into an output result list.

The disclosure also includes a method of processing a user query. Themethod includes obtaining a query from a user, distributing the query toa plurality of independent nodes, each node configured to search acorresponding segment of an index of the corpus of documents, runningthe query in each of the nodes against the corresponding segment of theindex of the corpus of documents to obtain from each of the nodes anintermediate results list of one or more documents in the documentcorpus deemed responsive to the query, determining a maximum number ofresults to retrieve from each intermediate results list based on a totalnumber of results desired, a number of independent nodes, and apredetermined confidence level, retrieving up to the maximum number ofresults from each of the intermediate results lists, and generating anaggregate output result list having no more than the total number ofresults desired.

The disclosure also includes an apparatus for processing a user query.The apparatus includes a plurality of computational nodes, eachcomputational node configured to search a corresponding segment of anindex of the corpus of documents to obtain from each of thecomputational nodes an intermediate results list of one or moredocuments in the document corpus deemed responsive to the query, and afederator configured to receive a query and communicate the query toeach of the plurality of computational nodes and configured to aggregatea portion of at least one intermediate result list to generate an outputresult list.

BRIEF DESCRIPTION OF THE DRAWINGS

The features, objects, and advantages of embodiments of the disclosurewill become more apparent from the detailed description set forth belowwhen taken in conjunction with the drawings, in which like elements bearlike reference numerals.

FIG. 1 is a simplified functional block diagram of an embodiment of anetworked computer system.

FIG. 2 is a simplified functional block diagram of an embodiment of astorage device having an application.

FIG. 3 is a simplified functional block diagram of an embodiment of acomputer.

FIG. 4 is a simplified functional block diagram of an embodiment of anetwork based service provider.

FIG. 5 is a simplified functional block diagram of an embodiment of adata source and search engine.

FIG. 6 is a simplified functional block diagram of an embodiment of adata source, search engine, and search index.

FIG. 7 is a simplified functional block diagram of a search engine.

FIGS. 8A-8B are examples of screen images captured from an embodiment ofa service provider.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Overview

FIG. 1 is a simplified functional block diagram of an embodiment of anetworked computer system 70 in which embodiments of the search systemdescribed herein may be implemented. The networked computer system 70 isshown and described in the context of web-based applications configuredon client and server apparatus coupled to a common Internet. However,the described system environment is used only as an example of one suchsystem into which embodiments disclosed herein may be implemented. Thevarious searchers and search system embodiments disclosed herein canalso be implemented in other systems that do not require networkedconfigurations.

The networked computer system 70 includes one or more clients 10 coupledto a network gateway. In one embodiment, the network gateway can be anInternet Service Provider (ISP) server 72 coupled to a network 74 suchas the Internet. The client 10 can communicate with the ISP server 72and destination devices coupled to the network 74 using one or morepredetermined communication protocols. The communication protocol can bedetermined, for example, by an application running on the client 10. Inone embodiment, the application can be a web browser and thecommunication protocol can be TCP/IP.

A web server 76 can be coupled to the network 74. The web server 76 canbe configured to provide a network interface to information accessiblevia a database server 80. The web server 76 can be a host for a websiteor destination accessible via the network 76. The web server 76 canhost, for example, a portal or website for a merchant. In anotherembodiment, the web server 76 can be configured to host a portal orwebsite for an aggregator of data or information.

The web server 76 can be coupled to a database server 80 that can beconfigured to access data that is stored on a related data server 78.The database server 80 can be one or more servers or applicationsdistinct from the web server 76 or at least partially common with theweb server 76.

The database server 80 can be configured to facilitate the retrieval ofdata stored in the data server 78. For example, the database server 80can organize and categorize the data in the data server 78 according toone or more predetermined taxonomies. The database server 80 can also beconfigured to include one or more searchers having one or more searchengines configured to access the information stored on the data server78.

The data server 78 can be configured as one or more servers orapplications that are distinct or at least partially common with thedatabase server 80. The data server 78 can include data or informationthat is retrieved via the database server 80. In another embodiment, thedata server 78 can be configured to directly search, retrieve, and storedata that is available, for example, from a source that is coupled tothe network 74.

For example, the data server 78 can be configured to store data that issourced from a merchant server 77 that is coupled to the network 74. Themerchant server 77 can be configured, for example, as an electroniccommerce site offering one or more goods or services. The data server 78can be a data store for an aggregator that collects and providesinformation relating to the purchase of goods and services.

In one embodiment, the data server 78 or database server 80 can includea crawler, spider, robot, or similar application that is configured todiscover and retrieve relevant data from sources coupled to the network74. The robot can return the retrieved information to the data server78. Additionally, or alternatively, the merchant server 77 can beconfigured to provide data to the database server 80 or the data server78 for storage in the data server 78.

Other network sources can similarly provide information that can bestored in the data server 78. For example, a web client 10 user cansubmit a document, such as a product review, to a web server 76 of anaggregator site for storage on a date server 78.

The database server 80 can organize the data stored in the data server78 according to one or more taxonomies. In any set of items, certainitems may have relationships that suggest grouping them together intocategories; and the resulting categories themselves may haverelationships that suggest grouping them into higher-level categories;and so on. Sometimes categories appear to represent natural, real-worldrelationships; and sometimes they are simply conventional or convenientbins used to organize items. Categories can be the product of humaninsight, algorithmic analysis, or both working together.

A classification system based on categories is referred to as a“taxonomy.” A common taxonomic structure is a tree-and-branch hierarchy,which proceeds by subdivision from the highest-level category (referredto as the “trunk”), to intermediate categories (referred to as“branches”), to terminal categories (referred to as “leaf,” “leafcategory,” or “leaf node”), and ultimately to the underlying items (eachreferred to as a “document,” “product,” or “offer”). In general, anynode or branch above a leaf node can be referred to as a higher node.Thus, every higher node is essentially a container of everythingorganized below it in the taxonomy. The leaf categories are thecontainers for the documents contained within them.

The underlying items within each leaf node can be referred to as“documents.” Alternatively, the items within a leaf node may be referredto as a “product” or “offer.” Each document can directly or indirectlyinclude relevant information. For example, the document can be a filehaving information relevant to the corresponding leaf category.Alternatively, the document can be a pointer or address, such as aUniform Resource Locator (URL) that identifies a site having relevantinformation.

Another general taxonomy structure is a directed arc structure,conceptually a graph consisting of taxonomic nodes and directed arcsthat connect nodes. Each node represents a category, and each arcrepresents a parent-child relationship between categories, with the arcpointing from the parent category to the child category. Arcs do notnecessarily follow the strict binary separation seen in manyhierarchical tree-and-branch structures. Items may appear in more thanone leaf category. Taxonomies differ in the rules for definingrelationships between categories and items. The exact structural rulesare nonessential. What is important is the general system where itemsfit into predetermined leaf-categories.

Documents stored in the data server 78 lend themselves to taxonomicarrangement. Each document can be an item, and an analysis ofrelationships among documents yields categories and leaf categories. Forexample, a service provider operating a portal website as a shoppingservice may create a taxonomy of consumer product categories andorganize documents describing products, services, and offers by productcategory. For example, a parent category called “ELECTRONICS” mightcontain child categories called “DIGITAL CAMERAS,” “TELEVISIONS,” “DVDPLAYERS,” and so on. If “DIGITAL CAMERAS” represents a leaf-category,then it contains zero or more indexed documents describing products,services, or offers involving particular cameras.

A user at a web client 10 can access the information stored in the dataserver 78 by accessing the web server 76. For example, a user at the webclient 10 can access the information at the data server 78 using one ormore applications running on the web client 10. The application can be,for example, a browser application.

The user can use the application running on the web client 10 to input aquery to the web server 76. The web server 76 can relay the query to asearcher on the database server 80. The database server 80 can include asearch engine that operates on the query to return a list of resultsthat match the query.

“Search engine,” as used herein, refers to a computer hardware andsoftware system that, given a search request or query, matches therequest to at least one index prepared from at least one data source, inorder to identify terms in the index relevant to the request; and thatif necessary performs logical operations on the request, theintermediate results, or both; and that returns a final list of relevantitems, if any, to the search originator.

For example, the web client 10 can receive a query from a user and cansubmit the query to the web server 76. The web server 76 can present thequery to a searcher or search engine in the database server 80. Thedatabase server 80 can respond to query by accessing the data server 78and receiving a response (search results) indicating one or more “hits”found. A query can be in the form of search terms or key words,structured query statements (SQL, Boolean expressions, regularexpressions, etc.), by selecting terms from choice lists, followinglinks or a number of other methods currently in use or obvious to one ofskill in the art upon review of current literature and/or the presentdisclosure.

When a query is received by a searcher, it processes the search andreturns one or more “hits”, where a “hit” is the atomic unit handled bythe search system. The searcher can be configured to operate on a bodyor corpus of documents stored within the data server 78. For example,where the search system manages a structured database, the hits arerecords from the structured database. Where the search system managesdocuments, such as text documents, image and text documents, imagedocuments, HTML documents, PDF documents, or the like, the atomic unitis the document. It should be understood that the present disclosure isnot limited to any particular atomic unit. Furthermore, a structureddatabase is not required.

For example, the web client 10 can include a desktop personal computer,workstation, laptop, personal digital assistant (PDA), cell phone, orany WAP-enabled device or any other computing device capable ofinterfacing directly or indirectly to the network 74. The web client 10typically runs a network interface application, which can be, forexample, a browsing program such as Microsoft's Internet Explorer™browser, Netscape Navigator™ browser, Mozilla™ browser, Opera™ browser,Firefox™ browser, or a WAP-enabled browser in the case of a cell phone,PDA or other wireless device, or the like. The network interfaceapplication can allow a user of the web client 10 to access, process andview information and documents available to it from servers in thesystem.

The web client 10 also typically includes one or more user interfacedevices, such as a keyboard, a mouse, touch screen, pen or the like, forinteracting with a graphical user interface (GUI) provided by thebrowser on a display (e.g., monitor screen, LCD display, etc.), inconjunction with pages, forms and other information provided by servers.Although the system is described in conjunction with the Internet, itshould be understood that other networks can be used instead of or inaddition to the Internet, such as an intranet, an extranet, a virtualprivate network (VPN), a non-TCP/IP based network, any LAN or WAN or thelike.

According to one embodiment, the web client 10 and all of its componentsare operator configurable using an application including computer coderun using a central processing unit such as an Intel Pentium™ processor,AMD Athlon™ processor, or the like or multiple processors. Computer codefor operating and configuring client system 20 to communicate, processand display data and media content as described herein is preferablydownloaded and stored on a processor readable storage medium, such as ahard disk, but the entire program code, or portions thereof, may also bestored in any other volatile or non-volatile memory medium or device asis well known, such as a ROM or RAM, or provided on any media capable ofstoring program code, such as a compact disk (CD) medium, a digitalversatile disk (DVD) medium, a floppy disk, and the like. Additionally,the entire program code, or portions thereof, may be transmitted anddownloaded from a software source, e.g., from one of the servers overthe Internet, or transmitted over any other network connection (e.g.,extranet, VPN, LAN, or other conventional networks) using anycommunication medium and protocols (e.g., TCP/IP, HTTP, HTTPS, FTP,Ethernet, or other media and protocols).

It should be appreciated that computer code for implementing aspects ofthe present disclosure can be C, C++, HTML, XML, Java, JavaScript, etc.code, or any other suitable scripting language (e.g., VBScript), or anyother suitable programming language that can be executed on a client orserver or compiled to execute on a client or server.

FIG. 2 is a simplified functional block diagram of a storage device 248having an application that can be accessed and executed by a processorin a computer system. The application can be a client side applicationstored in the storage device 248 of a client and running on a clientsuch as the web client 10. Alternatively, the application can be storedin the data storage of a server, and running on a server such as thedatabase server 80.

The data store 248 can be one or more memory devices that can beaccessed by a processor. the data store 248 can include application code250 that can be configured to store one or more processor readableinstructions. The application code 250 can include application logic252, library functions 254, and file I/O functions 256 associated withthe application.

The data store 248 can also include application variables 262 that caninclude one or more storage locations configured to receive inputvariables 64. The application variables 262 can include variables thatare generated by the application or otherwise local to the application.The application variables 262 can be generated, for example, from dataretrieved from an external source, such as a user or an external deviceor application. The processor can execute the application code 250 togenerate the application variables 262 provided to the data store 248.

One or more memory locations can be configured to store device data 266.Device data 266 can include data that is sourced by an external source,such as a user or an external device. The device data 266 can include,for example, the search query input by a user or the corpus of data thatis searched by a search application in response to a query.

The data store 248 can also include a log file 280 having one or morestorage locations 284 configured to store results of the application orinputs provided to the application. For example, the log file 280 can beconfigured to store a history of queries or can be configured to store ahistory of query results.

FIG. 3 is a simplified functional block diagram of an embodiment of acomputer 300. The web client 10 and each of the servers shown in thesystem 70 of FIG. 1 can be implemented as a computer 300. The computer300 can be configured to implement at least a portion of the datastorage shown in FIG. 2, to support an application.

The computer 300 can include a Central Processing Unit (CPU) 330 coupledto one or more associated devices over a bus 350. The CPU 330 can be ageneral purpose processor a Reduced Instruction Set Computer (RISC)processor, or a combination of processors that can include, for example,a general purpose processor and a digital signal processor.

Although the bus 350 is shown as a single bus, the bus 350 can includemultiple buses or communication links. For example, the computer 300 canimplement a first bus that is configured to couple the CPU 330 to localmemory, such as RAM 332. The computer 330 can also include one or moreadditional buses that are used to couple the CPU 330 to peripheraldevices.

The CPU 330 can be configured to access program storage 334 to retrieveand execute an application stored therein. Program storage 334 can beany type of memory, and can be implemented as internal memory orremovable memory. For example, program storage can include a hard disk,ROM, or some other type of memory.

The computer 300 can also include RAM 332 and data storage 336 typicallyused for temporary storage of data. The combination of RAM 332, programstorage 334, and data storage 336 can be configured as the data storeshown in FIG. 2. The computer 300 can include a clock 336 or timekeeping device configured to track time for applications that are timeor date related.

The computer 300 can also include one or more peripheral devicesconfigured as input/output (I/O) devices or as devices supporting orotherwise related to I/O devices. The peripheral devices can include anetwork driver 360 coupled to the bus 350 and configured to communicatewith a network interface device 362. The network interface device 362can be configured to interface the computer 300 with a network, such asthe network shown in the system of FIG. 1.

The peripheral devices can also include a keyboard driver 340 coupled tothe bus 350 that is configured to interface a keyboard to the computer300. Similarly, the computer 300 can include a mouse driver 342, displaydriver 344, and printer driver 346.

The computer 300 can also include a separate graphics processor 370configured to operate with graphics intensive applications in order toreduce the processing load on the CPU 330. In some embodiments, thegraphics processor 370 can be implemented with the display driver 344,for example, in a graphics card.

FIG. 4 is another simplified functional block diagram of an embodimentof a networked computer system 410. The system 410 comprises at leastone personal computer 412 (operated by a search originator, such as, ashopper, a robot “spider,” or a partner site) connected to a network 414such as the Internet. The personal computer 412 can be, for example, thecomputer 300 of FIG. 3.

Service provider 416, such as, but not limited to, a shopping service,maintains at least one data source 420 readable by a search engine 430also connected (possibly indirectly) to network 414. In the embodimentof FIG. 1, the service provider 416 can include the web server 76database server 80 and data server 78. Personal computer 412 runsapplication software 413 such as a web browser capable of sending arequest 418 over network 414 to a designated network address such asthat of service provider 416. FIG. 8A shows a “screen shot” of arepresentative web page including a text field for entering keywords aspart of request 418.

Service provider 416 routes incoming request 418 to search engine 430,which interprets request 418 and compares it, or derivatives of it, toan internal index 440 prepared from data source 420. Search engine 430builds a list of matching documents 22 and can rank the match list inorder by some measure of relevance. Search engine 430 may truncate thematch list to some maximum length and may combine it with other datataken from data source 420 or other sources. Search engine 430 (perhapsindirectly by passing the match list to other components) then uses thematch list to build a response 419, for example, by adding markup codein HTML, XML, or similar languages and by adding text, pictures,scripts, summary data, or other information taken from data source 420or from other sources. Finally, service provider 416 sends response 419over network 414 to personal computer 412 where application software 413interprets response 419, for example by displaying a rendition of it tothe shopper. FIG. 8B shows a “screen shot” of a web page displaying arepresentative response 419.

FIG. 5 is a simplified functional block diagram of an embodiment of asearch engine 430 that can be a portion of a search system. The searchengine 430 can include one or more search logic units (SLUs) 32, searchauto categorizers (SACs) 39, federators 33, nodes 34, indexers 38,indexes 440, data or document servers 36, and formatters 37.

The SLU 32 is a component that receives requests or requests 418 fromthe front-end web site and interprets the requests or requests 418 bylexing, parsing, and other processing to generate processed queries 17.Federator 33 distributes queries 17 to one or more nodes 34, each ofwhich is a component that can be configured to perform search-matchingcalculations on index 440 (which is, for embodiments with multiple nodes34, split into multiple index segments 54). In another embodiment, thefederator 33 can distribute the original request 418 to each of thenodes 34 for processing.

The indexer 38 is a component that compiles index 440 from data source420 (and, for embodiments with multiple nodes 34, splits index 440 intosegments 54). Document server 36 is a component that stores documents 22and that, given a global key 24, returns the corresponding document 22.Indexer 38 and other system components can rely on document server 36 toobtain the content of documents 22.

Each node 34 returns its match list to federator 33, which (forembodiments with multiple nodes 34) builds a composite result 48 used tobuild response 419. If necessary, each node 34 or federator 33 uses amapping table prepared along with index 440 to translate each random key26 to the corresponding global key 24. Formatter 37 builds a formattedresponse 419 from a match list or composite result 48, for example, bycombining it with text, pictures, scripts, and other data obtained from,for example, document server 36. Formatter 37 then sends response 419 toservice provider's front-end web site, which passes response 419 to therequesting search originator via network 414.

Those skilled in the art will appreciate that all components of searchengine 430 shown in FIG. 5 are high-level functional componentscomprising hardware and software; that search engine 430 may have moreor fewer components; and that many variations are possible in thestructure and arrangement of these components.

Data Source, Documents, Global Keys

FIG. 6 is a functional block diagram of an embodiment of a data source420, which can be implemented as hardware, software, or combination ofhardware and software system adapted to store information for readyretrieval. For example, data source 420 can be implemented in the dataserver of the system of FIG. 1.

The data source 420 can include one or more documents 22. A document 22is a means of storing static or dynamic data of any type, structure, orcontent, such as, but not limited to, text, pictures, scripts, computedvalues, and uniform resource locators (URLs). For example, a document 22may be a structured or freeform text file containing lexicalinformation, numerical information, or both; or a document 22 may be alocal or remote web page accessed by manual or automatic methods; or adocument 22 may be a record in a database comprising records comprisingfields. For a shopping service, for example, a document 22 may be a textfile that encodes information about products, services, or offers in aspecified format and sequence. An “offer” in this context refers to aspecific merchant offer, referring to a specific product and comprising,for example, price, availability, and other selected attributes. Anoffer may comprise multiple documents. However, the contents and formatof the document 22 are not limited to any particular format or contentand any particular content.

Each document 22 in data source 420 may have a unique identifier,referred to as a global key 24, that is typically assigned when document22 is added to data source 420 and typically durable for the life ofdocument 22. Global key 24 provides an efficient, unambiguous, stable“name” or “handle” for document 22 for use by system components thatread, write, or otherwise reference document 22. The sequence of globalkeys 24 typically reflects the sequence of document addition, and theset of global keys 24 may become sparse after ongoing document deletion.To avoid undesirable properties associated with an index 440 that refersto documents 22 by means of non-random global keys 24, some embodimentsof search engine 430 can create a second, randomized set of keys calledrandom keys 26. Search engine 430 performs its internal computations onglobal keys 24 or random keys 26 and if necessary maps random keys 26 toglobal keys 24 for the benefit of components that operate on, orotherwise utilize, global keys 24.

Search Engine and Index

The search engine 430 can be a software and hardware system adapted toretrieve information from a data source 420. To use search engine 430, asearch originator such as a shopper creates a request 418, for example,by entering one or more text keywords, attribute restrictions, or bothon a web page. Search engine 430 may require a formal syntax such asBoolean syntax; or it may permit the use of natural language; and it maypermit the use of short-cut indicators or prompts such as check boxes,drop-down lists, or other user-interface devices. After creating request418, the search originator sends request 418 to search engine 430, forexample, via network 414.

Search engine 430 via SLU 32 processes request 418 to extractsignificant search keywords, attributes, and restrictions and to build aquery 17. Search engine 430 via federator 33 then sends query 17 to oneor more nodes 34 that compare query 17 to one or more indexes 40prepared from data source 420. For each term in index 440 that matchesquery 17, each node 34 returns a document list 48 comprising global keys24 or random keys 26 for relevant documents 22. The matching method mayconsider complex queries 17 comprising multiple keywords, attributes,and restrictions, in any combination, so that a “final” document list 48may represent the result of one or more logical or mathematicaloperations applied to one or more document lists 48. For embodimentswith multiple nodes 34, federator 33 consolidates multiple documentlists 48. After ranking the results by some measure of relevance, searchengine 430″maps each random key 26 to the corresponding global key 24.Search engine 430 via formatter 37 calls document server 36 to get oneor more documents 22 to obtain content for response 419. Eitherfederator 33 or each node 34 maps each random key 26 to thecorresponding global key 24 for the benefit of formatter 37. Formatter37 then builds response 419 by adding markup code in HTML, XML, orsimilar languages and by adding text, pictures, scripts, summary data,or other information taken from data source 420 or from other sources.

Index 440 is a list of one or more index records 42, each comprising aterm 44 relevant to data source 420, a unique index key 46, and data 47.The term “term” as used herein includes any item that is the potentialsubject of a search such as, but not limited to, single words, phrases,categories, products, services, and offers. “Term” also includesattributes and restrictions, where an “attribute” is any feature orcombination of features representable by any sort of token, and a“restriction” is any word, phrase, category, attribute, or similar itemto be excluded from searching or search results. For each term 44, data47 typically comprises a document list 48 comprising set one or morerandom keys 26 selected to identify documents 22 relevant to term 44.Document list 48 is sorted in ascending or descending random-key order.Data 47 may include information other than mere random keys 26, such as,but not limited to, a count of the number of occurrences of term 44 indocument 22. Each term 44 has a unique index key 46 that provides anefficient, unambiguous identifier for term 44 for use by search engine430 or other system components that read, write, or otherwise referencea term 44 in index 440. Unique index keys 46 also permit splitting adocument list 48 for a given term 44 among multiple nodes 34.

Search engine 430 builds index 440 via indexer 38 by parsing at leastone document 22 in data source 420 to extract terms 44 and a documentlist 48 for each term 44. Typically, indexer 38 starts with a firstdocument 22, parses it, advances to a second document 22, parses it, andso on until indexer 38 has examined every document 22 in data source420. The list of terms 44 may include predetermined or software-selectedwords, phrases, or other items, alone or in combination. For example,predetermined terms 44 might include product names known beforehand, andsoftware-selected terms 44 might include all salient words that occur indata source 420. Index 440 therefore is a product of a softwarecompilation run on data source 420, and index 440 must exist beforesearch engine 430 can perform searches. Updating index 440 requiresrecompiling it. In some embodiments, index 440 may be split intomultiple index segments 54, permitting a parallel execution ofindex-matching computations by multiple nodes 34 a through 34 n, asshown in FIG. 7.

Conceptually, index 440 for search engine 430 is similar to the familiarsubject index common for published books. The list of terms 44 is likethe alphabetical list of words in a subject index, and document list 48is like the set of page number references for each word. The generaloperation is also similar. With a subject index, a reader searches thealphabetical list for a word, looks up its indexed page number, and thenopens the book to that page. With a computerized index, search engine430 searches the list of terms 44 for terms 44 that match a request 418.For each match, search engine 430 returns a document list 48 as a basisfor building a response 419 ultimately sent to application software 413.Unlike a subject index, a computerized index can perform logicaloperations, mathematical operations, or both on request 418, terms 44,document lists 48, and other parts of the system. For example, ifrequest 418 comprises two terms 44 and seeks documents 22 containingboth terms 44, then search engine 430 can look up the document lists 48for both terms and compare the two lists to find their logicalintersection.

Word Deletion

Search engine 430 allows the search originator to submit freeform,natural-language search requests 18. As described above, SLU 32processes “raw” incoming requests 18 to generate queries 17 sent to oneor more nodes 34, and nodes 34 apply queries 17 to index 440 (or indexsegments 54) to perform low-level index-matching operations. Theprocessing applied by SLU 32 to derive queries 17 comprises variouslexical operations, such as, but not limited to: extracting the “stem”of a word to also search for its syntactical variations; checkingspelling to correct human error; looking up a word in a thesaurus toalso search for its synonyms; and eliminating common words (“stopwords”) to avoid excessive computation during searching. TypicalEnglish-language stop words include “a,” “an,” “the,” and similarsyntactic elements that occur so frequently in English that they arerarely salient terms searching; and other languages have equivalent stopwords. Those skilled in the art often prepare lists of language-specificstop words and ignore stop words for most searches.

“Word deletion,” as used herein, includes a lexical operation applied torequests 18 when generating queries 17. Search engine 430 can beconfigured to accept freeform, natural-language requests 18. Suchrequests 18 are often imprecise, verbose, or both. Many spontaneousrequests 18 initially fail to return a result at all because index 440does not, by strict logic, contain data that fulfills the request 418 orinitial base queries 17 generated from it. In an example, the initialrequest 418 may be overdetermined or may include miss-associatedcomponents. An example of a miss-associated request includes a requestfor a particular manufacturer's product identified by anothermanufacturer's part number or trade name.

Returning a “nothing found” response 419 is likely to frustrate andmislead the search originator. Data source 420 often does containdocuments 22 relevant to the search originator's intended field ofsearch (as distinguished from their verbatim request). Word deletion isa method for building a new query 17 b based on the original request418. As input, word deletion accepts a request 418 as formulated by thesearch originator. As output, it returns a subset of the words in theoriginal request 418. The SLU 32, via federator 33, can then submit thenew query 17 b to nodes 34 in an new attempt to obtain a relevantresult.

Word deletion attempts to remove words from the original search-query,based on its context as computed by a Search Auto Categorizer (SAC) 39.The SAC 39 is a search-engine component that, given a request 418 (or aderived query), returns a list of zero or more relevant leaf categories,each with a probability value to measure the relevance of the categoryto the request. SAC 39 provides such context through the mechanism ofthe probabilistic distribution of the leaf-categories in which thedocuments 22 relevant to the original request 418 may reside.“Leaf-categories,” as used herein, are the non-intersecting atomiccategories in a taxonomy, for example, that of a portal website. Thetaxonomic structure may be a “tree and branch” hierarchy; or, moregenerally, it may be a graph consisting of taxonomic nodes and directedarcs that connect nodes.

Each leaf category is a “terminal” category in the taxonomy and containszero or more documents from a data source. Leaf categories can be theonly categories that contain documents, which do not appear in parentcategories. As terminal categories, leaf categories do not containunderlying successor (or “child”) categories. In a directed-arctaxonomy, for example, leaf categories are those without arcs pointingat deeper categories. For each request (that is, a query), SAC 39computes probabilities only for leaf categories. Leaf categoriesrepresent pre-existing “bins.”

For example, for a portal website operated by a shopping service, eachleaf category contains zero or more documents describing products,services, or offers. For example, a “parent” category called“ELECTRONICS” is not a leaf category if it has subcategories, such as,“DIGITAL CAMERAS,” “TELEVISIONS,” “CLOTHING,” and so on. “DIGITALCAMERAS,” in contrast, is a leaf-category if it contains zero or moreindexed documents indicating offers for digital cameras and nosubcategories. In this situation, the SAC, given a query for “CanonPowerShot” (a brand name of a digital camera), would return “DIGITALCAMERAS” as the highest-probability leaf category. It might also returnother leaf categories at lower probabilities. For example, a tee shirtwith a Canon logo might trigger a lower-probability return for the leafcategory “CLOTHING.”

Thus, search engine 430 can receive a request for which no relevantresults will be returned. In such a situation, the search engine 430 canperform word deletion on the query in order to generate a query forwhich relevant results will be returned. The search engine 430 can beconfigured to perform word deletion according to one or more worddeletion algorithms. The word deletion algorithms can loosely bedescribed as word-pair analysis, exhaustive deletion, and build-updeletion.

Word-Pair Analysis

The search engine 430 can be configured to perform word pair analysis onan original query that will return no results without word deletion. Thesearch engine 430 inputs the original query to the SAC 39. The SAC 39operates on the original query and returns a list of leaf categories andfor each identified leaf category, the probability that the queryresults can be found in the leaf category.

The search engine 430 can then submit the original query to the SLU 32where the SLU 32 generates a plurality of word-pairs from the originalquery. In one embodiment, the SLU 32 can be configured to generate everypossible word pair combination from the original query. in anotherembodiment, the SLU 32 can be configured to generate a subset of allpossible word pair combinations. In yet other embodiments, the SLU 32can be configured to generate word pairs from the original query forfurther processing until a threshold number of word pairs have beenanalyzed, or until all possible word pairs have been analyzed.

The SLU 32 submits each word pair as a query. The search engine 430 canthen generate a count of documents corresponding to a search of the wordpair. The counts will correspond to each leaf category identified by theSAC 39 for the original query. The search engine 430 can compute theweighted average of the counts by weighting each of the counts by theprobability that the original query results lie within that leafcategory.

The SLU 32 can then select one of the word pairs from the list ofanalyzed word pairs. The SLU 32 can select the word pair based on theweighted averages. The SLU 32 can select the word pair based onpredetermined criteria. For example, the SLU 32 can select the word paircorresponding to the lowest weighted average if a larger, more general,list of search results is desired. Alternatively, the SLU 32 can selectthe word pair corresponding to the highest weighted average if asmaller, more narrowly focused list of search results is desired.

The SLU 32 then generates two queries based on the original query andthe selected word pair. The SLU 32 generates a first query by deletingfrom the original query, a first word from the selected word pair. TheSLU 32 similarly generates the second query by deleting from theoriginal query, a second word from the selected word pair.

The search engine 430 operates on each of the new queries and obtainsfor each of the two queries, a count of the documents in each of theleaf categories. For each of the queries, the search engine 430 computesa weighted average using the leaf category probabilities determinedusing the original query.

The SLU 32 can then select one of the two words in the selected wordpair for deletion based on the results of the two weighted averages. TheSLU 32 can delete the word corresponding to the highest weighted averageif a larger, more general, search result is desired. Alternatively, theSLU 32 can delete the word corresponding to the lowest weighted averageif a more focused, narrower, search result is desired.

The SLU 32 generates a truncated query by deleting the selected wordfrom the original query. The search engine operates on the truncatedquery to return search results. The search engine 430 can determine therelevance of the search results using the leaf category probabilitiesdetermined from the original query or from a truncated query.

Therefore, the word-pair analysis algorithm can be summarized asfollows:

1. Run SAC 39 on the original request, to obtain the distribution ofprobability among leaf-categories for the entire request.

2. For each word-pair in the original request, (a) run the word-pairthrough the search engine to obtain the counts of documents in everyleaf category; (b) for each count of offers or documents, compute theweighted average of the above counts, using the leaf-category weightsobtained in Step 1. “Word-pair” herein may refer to consecutive pairs orpermuted pairs, and the construction of word-pairs may retain or deletestop words from the original query.

3. Choose the pair according to the two desired optimization options asfollows: If a more generalistic larger set of final search results isdesired, choose the pair with the lowest weighted-average. If it isdesired to satisfy a user with a more specific and narrowly focusedquery, then choose the pair with the highest weighted-average.

4. For each single word from the word-pair chosen in Step 3, run theoriginal query minus that word through the search engine, to obtain thecounts of offers or documents in every leaf category, and compute theweighted average of the latter counts, using the leaf-category weightsobtained in Step 1.

5. Delete from the original query the word in Step 4 which leads to thelowest or highest weighted average, in accordance with the two desiredoptimization options. If a more generalistic larger set of final searchresults is desired, delete the word leading to the highest weightedaverage. If it is desired to satisfy a user with a more specific andnarrowly focused query, then delete the word leading to the lowestweighted average.

6. Rerun the search through the search engine and compute the finalrelevance using the distribution obtained in Step 1 and the truncatedquery obtained in Step 5.

Exhaustive Deletion

The search engine 430 can be configured to perform exhaustive deletionon an original query that will return no results without word deletion.Exhaustive deletion operates on the truncated queries rather than onword-pairs as in word pair analysis.

The SLU 32 generates a plurality of truncated queries based on theoriginal query. To generate the truncated query, one word is deletedfrom the original query. Therefore, the number of truncated queries isbased on the number of words in the original query. In anotherembodiment, one or more words can be deleted from the original query tocreate the truncated query. In such an embodiment, the number ofpossible truncated queries can be determined based on the number ofwords in the original query and the number of words that are deletedfrom the original query for each instance of the truncated query. Ofcourse, the SLU 32 need not generate all possible truncated queries, butmay generate a subset of all possible truncated queries based upon somepredetermined criteria. In other embodiments, particularly inembodiments where the query is not completely preprocessed to includeonly word stems, the SLU 32 may generate a truncated query by deletingone or more words or one or more portions of words.

The search engine 430 operates on each of the truncated queries andreturns a corresponding count of search results for each of thetruncated queries. If all of the truncated queries return no searchresults, the word deletion analysis is complete and no search resultsare returned.

However, if any of the truncated queries generates non-zero searchresults, the SLU 32 can select a desired truncated query. Viewed fromanother frame of reference, the SLU 32 selects a word for deletion fromthe original query. The SLU 32 can identify the word for deletion, orotherwise select a desired truncated query based on predeterminedcrieteria.

In one embodiment, the SLU 32 can select the word for deletion or selectthe desired truncated query according to the count values. If a narrow,more focused, search result is desired, the SLU 32 can select thetruncated query having the lowest non-zero count. Alternatively, if alarger, more general, search result is desired, the truncated querycorresponding to the highest count can be selected.

The search engine 430 can then return the search results correspondingto the selected truncated query.

Therefore, the exhaustive deletion algorithm can be summarized asfollows:

1. For each word wεq, execute search_count(q−w). Here q represents theoriginal query and w represents each word within the original query. Ifall counts are zero, return FAIL.

2. Operating only on the set of the words w which induced non-zerocounts in 1), i.e., for which search_count(q−w)≠0, set

$q_{out} = {q - {\arg\mspace{11mu}{\min\limits_{w \in q}\left( {{search\_ count}\left( {q - w} \right)} \right)}}}$Here, the min can be replaced with max in accordance with the desiredoptimization policy as described above.

3. Return to the user results produced by search(q_(out)).

Build-Up Deletion

The SLU 32 can perform build-up deletion by generating a truncated querybased on the individual words of the original query. The SLU 32 beginswith an empty or otherwise null truncated query and selects a word fromthe original query based on predetermined criteria.

In one embodiment, the SLU 32 can be configured to select a word fromthe original query for addition to the truncated query by selecting theword that results in the minimal search count when the word is added tothe truncated query. The SLU 32 also does not attempt to select wordsthat are already selected for the truncated query and does not selectwords that cause the truncated query to return no search results.

The SLU 32 continues to build the truncated query from the word in theoriginal query until the remaining non-selected words in the originalquery would cause the search to fail if added to the truncated query.The truncated query is then determined to be the subset of words builtup from the original query.

In another embodiment, the SLU 32 can be configured to select a wordfrom the original query for addition to the truncated query by selectingthe word that results in the maximum search count when the word is addedto the truncated query. The SLU 32 continues to select words from theoriginal query until the addition of another selected word causes thesearch of the truncated query to result in no results.

Therefore, the build-up deletion algorithm can be summarized as follows:

1. Start with the initially empty q_(out), choosing, at every step forthe inclusion into q_(out), the word that minimizessearch_count(q_(out)∪w).

2. Operating only on the set of the words w E q which induce non-zerocounts, i.e., for which search_count(q_(out)∪w)≠0, choose the w whichminimizes search_count(q_(out)∪w), and add it to the q_(out), i.e.,

$q_{out} = {q_{out}\bigcup{\arg\mspace{11mu}{\min\limits_{w \in q}\left( {{search\_ count}\left( {q_{out}\bigcup w} \right)} \right)}}}$Here, the min can be replaced with max in accordance with the desiredoptimization policy as described above.

3. Set q=q−w, where w is the word obtained in 1), i.e., delete w from q.

4. Repeat steps 2 and 3 until q=Ø.

5. Return to the user results produced by search(q_(out)).

In practice, the search engine 430 can perform word deletion accordingto multiple deletion algorithms. The search engine 430 can, for example,run the exhaustive deletion and build-up deletion algorithms in tandem.The search engine 430 can initially run exhaustive deletion on theoriginal query, and its result, if available, is returned to the user.If it returns FAIL, then the search engine 430 can perform build-updeletion on the same original query, and its result, if available, isreturned to the user. The Exhaustive Deletion is tried first so as toimpart a minimum alteration to the original query without invoking theBuild-Up Deletion.

Notice that for brevity, the functions search(

) and search_count(

) are shown in the above algorithms with only one argument, namely, thatof a query. In actual implementation, both functions can receiveadditional arguments restricting search and the deletion algorithms tooperate only on a subset of attributes and atomic categories, when sodesired by the user while navigating the site, or dictated by thebusiness needs.

Imprecise Requests and Categories

Such requests 18 are often imprecise, verbose, or both. Many spontaneousrequests 18 initially fail to return a result at all because index 440does not, by strict logic, contain data that fulfills the request 418 orinitial queries 17 generated from it. As noted earlier data source 420often does contain documents 22 relevant to the search originator'sintended field of search. Even if request 418 does yield one or moreexact-match documents 22, it remains useful to identify contextuallysimilar documents 22 based on request 418. The challenge is toaccurately identify documents 22 significantly related to request 418,for example, to augment or redirect response 419 based on a contextderived from request 418.

The taxonomy imposed on documents 22 groups together related documents22. A request 418 that matches a particular document 22 inside a givencategory therefore probably has a strong relationship to all documents22 in that category. For example, a request 418 that matches digitalcamera “A” probably has a strong affinity to digital cameras “B” and “C”even if “B” and “C” do not occur as direct search matches. And for arequest 418 that do not return any direct matches at all, identifyingcategories with strong correlations to the request 418 creates amechanism for constructing an appropriate response 419. The searchengine 430 can be configured to measure the relevance of leaf categoriesto requests (that is, queries) for searches. The search engine 430 canidentify leaf categories that have the best likelihood of fulfillingrequest 418. These leaf-categories may be a direct component of response419 or may be subject of further processing.

Measure of Leaf-Category Probability for a Request

Search Engine 430 includes a Search Auto Categorizer (SAC) 39 and SearchLogic Unit (SLU) 32. SAC 39 provides the SLU 32 with a query-contextthrough the mechanism of the probabilistic distribution of theleaf-categories in which the products relevant to the search-query mayreside. The relevance of any leaf-category to the entered query isdefined as the probability of such leaf-category in the presence of thequery.

The search engine 430 can compute relevance for each of theleaf-categories available on the website by obtaining an initialrelevance-value estimate and then refining that estimate to convergence.The SAC 32 can generate an initial probability that the results liewithin a particular leaf category. The SLU 39 can generate a relevancebased on the probabilities generated by the SAC 32.

The probability values generated by the SAC 32 can be modified byoperation of the SLU 39 to improve the relevance of the search results.The SLU 39 can weight the original probabilities determined by the SAC32 to generate weighted probabilities. The SLU 39 can then operate onthe weighted probabilities to generate a refined relevance. The SLU 39can iteratively weight the probabilities and re-determine the relevanceuntil the relevance converges to within a predetermined threshold.

The search engine 430 initially receives a request or query. The SAC 32generates a list of leaf categories and the corresponding probabilitiesthat the search results reside in the particular leaf categories. TheSLU 39 uses the probabilities generated by the SAC 32 to determine arelevance of the search results. The SLU 39 can determine the relevanceby initially determining the probability of a given product or documentmatching the query in a given leaf category.

In one embodiment, the SLU 39 can determine the probability of adocument matching the query in a given leaf category by examining theextent to which the query matches the text or attributes of a document.For example, the SLU 39 can determine how many of the words in the queryare in the actual document, and whether the words appear in the title,description, body, or some other portion of the document. The SLU 39 cantake into account whether the matching query words appear in the samedocument field or if they are scattered across multiple fields. The SLU39 can be configured to take into account other factors, such asin-stock availability, price, image availability, popularity based onprior user selections, merchant, and other factors. The SLU 39 candetermine the probability of a document matching the query in a givenleaf category based on a combination of all or a subset of thesefactors. For example, each of the factors may be associated with aprobability or other metric and the SLU 39 can multiply together all ofthe relevant probabilities or metrics to arrive at the probability of adocument matching the query in a given leaf category.

The SLU 39 then computes the relevance of the product to the originalquery. The SLU 39 can determine this relevance by weighting each of theearlier probabilities determined by the SLU 39 by the leaf categoryprobabilities determined by the SAC 32, and summing over all leafcategories.

The SLU 39 can generate a revised probability that the search resultsreside in the particular leaf categories. That is, the SLU 39 can revisethe probabilities initially determined by the SAC 32. For each leafcategory, the SLU 39 can generate a revised probability that the productresides in that leaf category by weighting each relevance result fromthat leaf category previously determined by the SLU 39 by apredetermined weighting factor less than one. The revised leaf categoryprobability is the sum of all weighted relevance probabilities withinthat leaf category.

The SLU 39 then determines an updated relevance based on the revisedleaf category probabilities. The SLU 39 can continue to revise the leafcategory probabilities based on the updated results and re-determine theupdated relevance until the relevance values converge to within apredetermined threshold.

The relevance ranker can be summarized as follows:

1. In response to a search request, q, SAC computes an initial estimateof the probability P(cid|q) that products (documents) relevant to thatquery reside in a given leaf-category, denoted cid.

2. SLU computes the probability, P(pid|q, cid), representing therelevance of a product, denoted pid, found to match the query, q, in thecategory, cid.

3. The components computed in (1) and (2) are now combined to obtain theprobability:

$\begin{matrix}{{P\left( {{pid}❘q} \right)} = {\sum\limits_{cid}\;{{P\left( {{{pid}❘q},{cid}} \right)} \cdot {P\left( {{cid}❘q} \right)}}}} & (1)\end{matrix}$representing the relevance of the product, pid, to the query, q. Here,the summation extends over all of the categories, cid; where the productwas found by the SLU to match the query, q.

4. SLU is now ready to improve the initial estimate obtained in (1) ofthe probability P (cid|q) as follows: within each category, cid, whereproducts were found to match the query, let

$\begin{matrix}{{P\left( {{cid}❘q} \right)} = {\sum\limits_{{pid} \in {cid}}{w_{pid} \cdot {P\left( {{pid}❘q} \right)}}}} & (2)\end{matrix}$where the weights w_(pid) can be chosen according to businessrequirements modeled by a utility function combining the utility to thesearch-users as well as to the business entity providing the search,such that

${\sum\limits_{{pid} \in {cid}}w_{pid}} = 1.$For example, the weights can be chosen to decay monotonically withP(pid|q) within each cid, so as to model user's decreasing with thenumber of products attention span and interest.

5. Steps 3 and 4 are now iterated until the probability P(cid|q)converges to within prescribed tolerance between any two consecutiveiterations.

6. The final probability of P(cid|q) is returned.

Parallel Implementation

Referring also to FIG. 5 a search engine 430 can include one or morecomputational nodes 34 a through 34 n governed by a coordinatingcomponent called a federator 33. Federator 33 comprises computerhardware, software, and methods that manage and coordinate multiplenodes 34 a through 34 n. Federator 33 broadcasts each query 17 to a setof one or more nodes 34 a-n, each comprising hardware and softwarecomponents. Each node 34 a-n stores a segment 54 a-n of index 440,evaluates each query 17 using its index segment 54 a-n, and returns itssubset result to federator 33. Federator 33 aggregates subset results 56a-n from all nodes 34 to build a composite result 48. For embodimentsthat employ random keys 26, each node 34 a-n translates each random key26 in its subset result 56 a-n to the corresponding global key 24 beforepassing its subset result 56 a-n to federator 33. For other suchembodiments, federator 33 maintains node-specific mapping tables, sothat federator 33 can translate between key sets to build a compositeresult 48.

Index segments 54 a-n are logically independent, so each node 34 a-n canoperate independently. Running the index-matching computations inparallel on multiple nodes 34 a-n is faster than running the samecomputations in sequence on a single node or comparable computer. Thespeed improvement is approximately proportional to the number of nodes34 a-n. Therefore, search speed (also called “join time”) can beapproximately halved by doubling the number of nodes 34 a-n. The totalnumber of nodes 34 a-n is variable and can be expanded almost withoutlimit.

Splitting an Index into Multiple Segments

The process for preparing multiple index segments 54 from data source420 for multiple nodes 34 is generally similar that for preparing asingle index 440 except that, after preparing a list of terms 44 to beindexed and assigning an index key 46 to each term 44, data source 420is apportioned into one or more data-source subsets 21 for separateindexing by indexer 38. Each subset 21 comprises one or more documents22 selected without replacement from data source 420, effectivelydividing data source 420 into approximately equal (or, in someembodiments, proportionately divided) portions, one portion per subset21. For example, assume a data source 420 comprising five-hundreddocuments 22; and further assume a search engine 430 with ten nodes 34,and hence requiring ten index segments 54. Assuming nodes withapproximately equal computational capacity, a general goal in this caseis to divide data source 420 into ten subsets 21 a-j, each comprisingapproximately fifty documents 22, with each document 22 sent to one andonly one subset 21 a-j. This process is generally similar to thefamiliar method for dealing cards from a deck to players in a rotatingsequence, so that every player gets about the same number of cards, butdealing a card removes that card from the deck and assigns it to aparticular player.

After thus partitioning data source 22, indexer 38 indexes a firstsubset 21 a as previously described, then independently indexes a secondsubset 21 b, and so on, until done. The result is a set of indexsegments 54 a-n, typically one segment 54 per node 22. The set of indexkeys 46 is common to all subsets 21 and index segments 54, but eachsubset 21 a-n has its own set of global keys 24 (or, in someembodiments, random keys 26). A particular key value on a first node 54a may indicate a different document than the same key value indicates ona second node 54 b. In an embodiment, each node 34 a-n maintains its ownmapping table to translate its results (drawn from document lists 48expressed in node-specific random keys 26) to global keys 24 (which are“recognized” by other components of search engine 430). In analternative embodiment, federator 33 maintains a set of node-specificmapping tables to combine document lists 48 and other results frommultiple nodes 34 a-n without ambiguity.

When apportioning data source 420 into a given number of subsets 21 a-n,it is desirable to randomly or pseudo-randomly assign each document 22to a subset 21 a-n. The preferred embodiment is to create a true-randomapportionment, so that any document 22 is equally likely to be assignedto any subset. Many methods for randomly shuffling items into subsetsare known to those skilled in the art and are suitable for this purpose.A general goal is to achieve a randomized, even distribution of data tonodes 22, so that each node 22 has an approximately proportional “slice”of data source 420. As mentioned elsewhere, a goal is to substantiallybalance the workload among nodes, which may differ in computationalcapacity. Some embodiments may intentionally allocate predetermined,differing proportions of data to different nodes, for example, tocompensate for a faster node by giving it more data. “Even distribution”does not imply a requirement for exact mathematical equality orproportionality.

An alternative embodiment employs a pseudo-random apportionment ofdocuments 22 into subsets 21 a-n. For example, a representative methodfor pseudo-randomly dividing data source 420 into a given number ofsubsets 21 a-n is to apply a modulus function to the unique global key24 previously assigned to each document 22. Modulus is a relative ofarithmetical division that, given a dividend and a divisor, yields theremainder but not the quotient. By using a given document's global-keyvalue as a dividend and a desired number of index subsets 21 as adivisor, the result is an integer remainder that identifies an indexsubset 21 for that document 22. For example, assume a set of documents22 with the global keys [11, 14, 19, 27, 32]. Further assume two nodes34 a,b, yielding a need to compile two index segments 54 a,b from twoindex subsets 21 a,b. Here, modulus with a divisor of “2” yields “0” forall even keys [14, 32] and yields “1” for all odd keys [11, 19, 27].Here, modulus segregates “even” documents from “odd” documents, yieldingtwo subsets 21 a,b, with a first subset 21 a comprising two documents[14, 32] and a second subset 21 b comprising three documents [11, 19,27].

A benefit to random or pseudo-random allocation is that it tends todistribute documents 22 with consecutive global keys 24 to differentsubsets 21 a-n. Related documents 22 tend to enter data source 420 atabout the same time, so that sequences of global keys 24 tend to reflectunderlying relationships among documents 22. Randomly or pseudo-randomlyassigning documents 22 to index subsets 21 a-n disrupts this clusteringbecause it sends documents 22 with adjacent global keys 24 to differentsubsets 21 a-n and ultimately to different index segments 54 a-n andnodes 34 a-n. This disruption tends to spread related documents 22 amongall nodes 34 a-n and tends to minimize the risk that any one node 34“has” a disproportionate share of documents 22 relevant to a given term44. Document shuffling equitably distributes the computational workloadamong the available nodes 22. It also helps to assure that each node 22returns a statistically representative sample of data drawn from datasource 420.

For index segments 54 thus prepared, a given document 22 n “contributes”to one and only one index segment 54 n. All references to document 22 noccur in index segment 54 n as data in its document lists 48, andsegments other than segment 54 n do not contain data any data derivedfrom document 22 n.

Efficient Federation

Although a search-engine 30 according to the present invention offersalmost unlimited scalability at the node level, federator 33 performsits processing in series, not in parallel. Its serial operation reflectsthe nature of its tasks. For example, the task of aggregating subsetresults 56 a-n to build a composite result 58 does not lend itself toparallel execution because it is a “many to one” operation. Theefficiency of federator 33 is a crucial aspect of the efficiency ofsearch engine 430 taken as a whole.

For example, assume that index 440 doubles in size, for example, as aresult of an increase in the number of documents 22 in data source 420.A search engine 430 according to present invention can keep theindex-matching time (join time) approximately constant by doubling thenumber of nodes 34 a-n in order to maintain an approximately constantsize for each index segment 54. In this situation, federator 33 cannotavoid aggregating twice as many subset results, however, more thandoubling the computational cost for building composite result 58(because federator 33 needs to sort results, which is an nlognoperation).

A randomized or pseudo-randomized shuffling of documents allowsspecification of the probability of missing a document 22 from a resultset. In particular, if the search engine is required to return the bests overall matches for a query against n nodes, assuming that the totalproduct set is much greater than s, then if each node returns n₁documents then the chance of aggregation of all the node result setsmissing at least one of the required top s documents (due to it notbeing returned by a node) is

$\begin{pmatrix}{s - 1} \\n_{1}\end{pmatrix}/{n^{n_{1}}.}$In other embodiments, the confidence level can be tabulated empiricallythrough stochastic simulation, or by a recursive equation operating onthe same or different variables. In an actual application, eachcomputational node may return zero or more documents in its intermediateresult set, and the number n₁ represents a maximum number of documentsto retrieve from each intermediate result set.

Explicit tradeoffs between aggregation speed and probability of anincomplete result set readily follow. In some cases, for example, it maybe worth returning only ten documents for each of twenty nodes tosatisfy a query asking for the top fifty products. Knowing that such ashortcut gives a

${\begin{pmatrix}49 \\10\end{pmatrix}/20^{10}},$or about 1/1000, chance of missing a top fifty product allowsdetermination of whether this chance of missing the true result is worththe decreased federation time.

These surprising and beneficial results are easiest to understand bystarting with a description of what would happen otherwise, withoutdocument shuffling.

To see what would happen without a shuffling of documents 22, considerindex segments prepared by compiling a single, global index and thendividing it “vertically,” for example, by ranges of index keys. Here,all of the entries for a given term 44 reside on a single node, yieldingparallel index storage but not parallel search processing.

Alternatively, consider a multiple indexes prepared as described abovebut without document shuffling. This “horizontal” split yields indexsegments that list all of (or most of) the indexed terms, but the datain the document lists for each term derives from a set of associated orcorrelated documents. The goal of a typical search is to return aspecified number of best-match entries from index 440. Withoutshuffling, any given node might happen to store all of, most of, or adisproportionate share of the best-match entries for a given search.Each node therefore must return a subset result at least as large as thespecified best-match number. Otherwise, the search might easily fail toidentify a significant number of salient documents. Multiple nodestherefore flood the federator with a large volume of results.

Document shuffling allows each node 34 a-n to return fewer results tofederator 33 than would be otherwise required, reducing thecomputational burden on federator 33. For example, suppose a request 418requires a composite result 58 consisting of one-hundred matches.Without shuffling, the top one-hundred matches from any given node 34a-n might be the top one hundred matches. Every node 34 a-n thereforemust return one-hundred matches to federator 33, in order to be surethat composite result 58 contains the one-hundred best matches. Withforty nodes, federator 33 must sort four-thousand items to build acomposite result 58 that lists the top one-hundred matches. Withshuffling, each node 34 a-n can return fewer matches than the numberrequired in composite result 58. The number of matches specified foreach subset result 56 a-n depends a desired level of statisticalcertainty, where in general increasing the size of each subset result 34a-n decreases the risk of missing salient matches. Attaining a givenlevel of confidence might require each node to return only fiftymatches, for example. With forty nodes, federator 33 now must sort onlytwo-thousand items to build a composite result containing one-hundredmatches.

Based on the assumption that documents are randomly distributed acrossnodes, the closed-form solution calculated the number of matches thateach node must return to guarantee with an arbitrary-but-specifiedconfidence level that the final aggregated result set has the top nproducts. The solution makes use of the fact that, given a randomdistribution of products across n nodes, the chance of the top twomatches being on the same node is 1/n, and the chance of the top threematches being on the same node is 1/n^2. A specific innovation of thepresent invention is the statistical methodology to allow each node toreturn fewer than the top n items for a query asking for the top n.

Random Keys

As mentioned above, in an embodiment, each document 22 has a global key24 to provide a durable identifier for each document 22. The sequence ofglobal keys 24 contains embedded biases reflecting known and occultrelationships among documents 22. For example, assume a data source 420containing product data with global keys 24 assigned sequentially foreach new document 22. Incoming batches of new documents 22 tend to berelated to each other, so blocks of consecutive global keys 24 are morelikely than average to reflect related information. For a shopping site,for example, a photographic supplier might send an email announcingseveral new cameras. This information would tend to enter the datasource 420 as approximately consecutive documents 22 and generateapproximately consecutive global keys 24. If index records 42 containdocument lists 48 with an internal sequence governed by global keys 24,then operations performed on document lists 48 reflect the statisticalbiases or clusters embedded in the sequence of documents 22. Anoperation terminated before traversing the entire document list 48therefore would not return a statistically representative sample of datasource 420.

To avoid statistical biases in global keys 24, a search engine 430according to the present invention creates a set of keys, called randomkeys 26, for internal use by federator 33, nodes 34, and othercomponents of search engine 430. One method for randomizing random keys26 is to start by counting the current number of documents 22. Given aknown number of documents 22, indexer 38 can generate a same-sized poolof consecutive keys, select a first random key 26 randomly orpseudo-randomly from the pool, assign it to the first document 22,remove the used key 26 from the pool to assure uniqueness among keys,and continue in like manner to each succeeding document 22 until alldocuments 22 have a unique random key 26 selected at random from thepool. True-random selection is preferable, but in this context “random”includes pseudo-random selection.

That is, for a data source 420 on n documents 22, each with anarbitrary, unique global key 24, a random mapping to random keys 26 canbe efficiently, but not uniquely, constructed by assigning to eachglobal key 24 a number randomly selected, without replacement, from theinteger interval [1,n]. Indexer 38 also constructs a mapping table topreserve the resulting tuples, each comprised of a global key 24, randomkey 26, and index key 46. In the case of a parallel division of index440 across multiple nodes 34, the mapping is random on each individualindex segment 54.

For example, in the case of a keyword index 440 for product documents22, the term “camera” might occur in one-hundred thousand productdocuments 22. Document list 48 for “camera” therefore would compriseone-hundred thousand documents 22, with documents 22 occurring in randomorder with respect to the original data source, because the documentlist 48 for “camera” is sorted by random keys 26 and not by global keys24.

This method for assigning random keys 26 yields a compact set of randomkeys 26, meaning that all consecutive values appear in the set of randomkeys 26, even though the sequence has been randomized. The term“compact” as used herein includes any uniform (equidistant) distributionof keys. Incremental adjacency, counting by ones, is a convenientcompact distribution used as a representative example throughout. Globalkeys 24, in contrast, may reflect a discontinuous sequence, for example,as a result of ongoing deletion of documents 22 from data source 420.Random keys 26 assigned according of the present invention should becompact in this sense. For example, for an interpolation search with(say) ten documents with minimum random key 1 and maximum key 1,000,then it is known that on average each key takes up one-hundred units ofspace. If the keys were not compact, then perhaps two to nine-hundredwould all be empty, in which case, the remaining nine keys would bedistributed among only one-hundred numbers, leading to a very differentrequired jump sequence for interpolation. Compactness facilitates theestimation of the density of the uniform distribution of random keyswithin a reverse index element. It would be possible to estimate thedensity of the uniform distribution if the distribution of thenon-compactness was also known and global (for example, if every tenthnumber were skipped). This alternative embodiment introduces additionalcomplexity than simply making the keys compact, and compactness hasadditional benefits such as faster execution.

When search engine 430 searches index 440, it looks up term 44 accordingto the value of its index key 46 to get a document list 48. Because theset of random keys 26 is randomized with respect to the set of globalkeys 24, the sequence of documents keys 26 within a document list 48 israndom with respect to the underlying documents 22 in the data source420. As a result, operations applied to one or more document lists 48but terminated before traversing the entire document list 48 returnstatistically randomized random keys 26 indicating documents 22 that area representative sample of data source 420.

Estimation Techniques

Searching in a sequence governed by sorted random keys 26 randomlymapped to global keys 24 means that any segment of the search sequenceis a representative sample of data source 420. This represents is asimple and elegant solution provided by the present invention relativeto the cumbersome approaches of the prior art. As a result of thepresent invention, search engine 430 can truncate certain searches andextrapolate from the subset results. For example, it is desirable todetermine and report summary data such the total number of matchesresulting from each search request 418. Since each request 418 is anunpredictable event, search engine 430 cannot pre-calculate summary datafor every possible search. A randomized mapping of random keys todata-source keys eases the computational burden of preparing summarydata because search engine 430 can estimate summary data withouttraversing the entire index. For example, a request 418 estimated toreturn one-million matches can be halted at one hundred-thousandmatches, yielding a large sample but requiring about one-tenth thecomputation. Any desired summary data can be computed on this subset,and then a simple multiplication of summary counts by ten yieldsreliable estimates of the results that would have been obtained if theentire index had been queried.

The representative nature of the running results allows the searchengine to dynamically determine when to perform such truncation byanalyzing data from an ongoing search. For example, consider twodifferent running outcomes when joining a one-million document termterm1 to another one-million document term term2. Among the firstsequential 1,000 documents in the document list for term1, (a) 1,000matched documents are found in term2, versus (b) one matched document isfound in term2. Using appropriate inferential statistics, the searchengine calculates the expected number of documents returned to be muchgreater for (a) then (b), and so it may stop the search in case (a) muchearlier.

Even if all the documents are returned from a search, so that the fulljoin is performed, sampling made possible by random ids is a potentialtimesaver for operations that are performed on the result set after thejoin. For example, the search engine is commonly requested to providesummary counts of the attributes present in the final result set. Eachdocument may be associated with a large number of attributes, so in thenaïve case attribute summaries would require, for each document in theresults, counting each attribute linked to the document. More efficientcount estimates could be generated by randomly selecting documents whoseattributes to count; however, such selection is itself a fairlyexpensive procedure. In contrast, the random ordering of documents inthe result set, ensured by the reverse index's sorting by random id andthe sequential nature of processing of documents within terms, allowsthe search engine to simply take the first n documents of the results asa random sample from which to generate attribute counts.

Interpolation Search

Another advantage to randomized keys is that search engine 430 canemploy an interpolation-search technique (Knuth, Sorting and Searching,Second Edition (Reading, Mass.: Addison-Wesley, 1998) for findingmatches, saving computation time in comparison to alternative techniquessuch as linear or binary searching. In a linear search, the searchevaluates each entry in a list in sequence from the beginning. In abinary search, the starting point is the exact middle of the list, andeach successive starting point divides the appropriate half of theremaining list exactly in half.

In an interpolation search, in contrast, a search starts by estimatingthe first position to check and continues by estimation until done. Fora familiar analogy, consider the act of searching a printed phone bookfor the last name “Xavier.” The letter “X” is known to be near the endof the alphabet. Given this knowledge, it is efficient to start thesearch near the end of the phone book. And if the first page testedcontains names starting with “S,” then it is efficient to try again nearthe end of the remaining pages (“S” through “Z”), since “X” is closer to“Z” than “X” is to “S.”

The ability to use an interpolation search depends knowing how the itemsare distributed in a list, in order to rapidly make a good estimate of atarget location. For example, if a list is known to be a list ofintegers 1 to n, and if the target is given as “t,” then the estimatedlocation would be index “t”—a match in one step as a result of the(perfectly) known distribution of the data.

Use of an interpolation-search technique allows search engine 430 toavoid performing a linear search (scanning the entire list) or binarysearch (successively dividing the remaining part of the list in half).For example, an interpolation search for a document whose random key is100,000 within a reverse index with a maximum random key of 400,000,minimum of 1, and two-hundred thousand total documents, starts with aninitial guess at quite efficient position of 50,000. It arrives at thisestimate because there are two-hundred thousand documents spreaduniformly in the range of 1 to 400,000, so that each document has anexpected coverage of two; therefore, document key 100,000 is expected tobe at 50,000. Then, depending on whether the initial position evaluateshigh or low, the search applies the same technique to the appropriateremaining interval of possibility.

Although extremely powerful, interpolation search is little used inpractice because of the need to know the distribution of the underlyinglist. Randomization of random keys is a means of ensuring a knowndistribution—namely, an asymptotically uniform distribution.

Index 440 for any given term 44 contains a document list 48 of all thedocuments 22 that have that term 44. Document list 48 is ordered byrandom key 26 and hence has an expected uniform distance betweenelements. For example, if the term “the” has a document list 48comprising one-million random keys 26 ranging from key 1 to 2,000,000,then on average each document accounts for a space of 2. That is, thedocument ids (random keys) of an index term that points to x documentsidentified by random keys out of a total set of n indexed documents area random sample without replacement from [1,n], meaning that thedocument ids (random keys) are uniformly distributed on [1,n] with adensity of x/n. Thus on average, each record takes up x/n of the totalspace of n (or, since the maximum record key n2 and minimum record keyn1 are easily obtained for the index term, more precisely x/(n2−n1) ofthe actual space of the term: the documents are uniformly distributed on[n1,n2] with density x/(n2−n1)). Assuming reasonable-size indexelements, so that boundary issues due to records near n2 and n1 may beneglected, each record then has on average x/(2n2−2n1) on each side ofit, so the average distance between adjacent records is x/(n2−n1).Knowing this average distance, the search engine can at its discretionemploy the highly efficient interpolation search algorithm.

Persons skilled in the art will recognize that many modifications andvariations are possible in the details, materials, and arrangements ofthe parts and actions which have been described and illustrated in orderto explain the nature of this invention and that such modifications andvariations do not depart from the spirit and scope of the teachings andclaims contained therein.

1. In a computerized search system in which queries are submitted byusers who receive, in response, a list of documents selected from acorpus of documents wherein the list comprises documents deemedresponsive to a user's query, a method of processing the querycomprising: obtaining the query from a user; distributing the query to aplurality of independent computational nodes, each computational nodeconfigured to search a corresponding segment of an index of the corpusof documents; running the query in each of the computational nodesagainst the corresponding segment of the index of the corpus ofdocuments to obtain from each of the computational nodes an intermediateresults list of zero or more documents in the document corpus deemedresponsive to the query; determining a number of results n₁ to retrievefrom each of the intermediate results lists obtained by thecomputational nodes, the determined number of results n₁ providing aquantitative probability of not retrieving one most responsive result inan aggregated number of results, where the determined number of resultsis less than the aggregated number of results and the quantitativeprobability is given by a statistical formula: ${\begin{pmatrix}{s - 1} \\n_{1}\end{pmatrix}/n^{n_{1}}},$ wherein n is a number of independentcomputational nodes and s is the number of aggregated results; andaggregating the determined number of results from each of theintermediate results lists containing the determined number of resultsor more into an output result list, otherwise aggregating all resultsfrom each of the intermediate results lists containing fewer than thedetermined number of results into the output result list.
 2. The methodof claim 1, wherein the corpus of documents is evenly distributed amongthe independent computational nodes.
 3. The method of claim 1, whereineach document within the corpus of documents is randomly assigned to onesegment of the index.
 4. The method of claim 1, wherein each documentwithin the corpus of documents is pseudo-randomly assigned to onesegment of the index.
 5. The method of claim 1, wherein determining thenumber of results comprises determining the number of results based inpart on a number of independent computational nodes.
 6. The method ofclaim 5, wherein the determined number of results is further based onthe aggregated number of results in the output result list.
 7. Themethod of claim 1, wherein determining the number of results comprisesdetermining the number of results independent of a number of results ineach intermediate results lists.
 8. The method of claim 1, whereindetermining the number of results comprises determining a plurality ofnumbers of results with one of the plurality of numbers of resultsapplicable to each of the intermediate results lists.
 9. The method ofclaim 1, wherein the same determined number of results applies for allnodes.
 10. The method of claim 1, wherein aggregating the number ofresults into the output result list comprises: retrieving up to thedetermined number of results from each of the intermediate results liststo generate the output result list; and ranking the documents in theoutput result list.
 11. In a computerized search system in which queriesare submitted by users who receive, in response, a list of documentsselected from a corpus of documents wherein the list comprises documentsdeemed responsive to a user's query, a method of processing the querycomprising: obtaining the query from a user; distributing the query to aplurality of independent computational nodes, each computational nodeconfigured to search a corresponding segment of an index of the corpusof documents; running the query in each of the computational nodesagainst the corresponding segment of the index of the corpus ofdocuments to obtain from each of the computational nodes an intermediateresults list of zero or more documents in the document corpus deemedresponsive to the query; determining a number of results n₁ to retrievefrom each intermediate results list based on a total number of resultsdesired, a number of independent nodes, and a quantitative probabilityof not retrieving one most responsive result in the total number ofresults desired, where the determined number of results n₁ is less thanthe total number of results desired and the quantitative probability isgiven by a statistical formula: ${\begin{pmatrix}{s - 1} \\n_{1}\end{pmatrix}/n^{n_{1}}},$ wherein n is a number of independentcomputational nodes and s is the total number of results desired;retrieving up to the determined number of results from each of theintermediate results lists; and generating an aggregate output resultlist having no more than the total number of results desired.
 12. In acomputerized search system in which queries are submitted by users whoreceive, in response, a list of documents selected from a corpus ofdocuments wherein the list comprises documents deemed responsive to auser's query, an apparatus for processing the query comprising: aplurality of computational nodes, each computational node configured tosearch a corresponding segment of an index of the corpus of documents toobtain an intermediate results list of zero or more documents in thedocument corpus deemed responsive to the query; and a federatorconfigured to receive the query and communicate the query to each of theplurality of computational nodes and configured to aggregate a portionof at least one intermediate result list to generate an output resultlist, the output result list including an aggregated number of results,the federator selecting the portion of the at least one intermediateresult list based in part on a quantitative probability of notretrieving one most responsive result in the aggregated number ofresults, wherein the federator is configured to determine a number ofresults n₁ to retrieve from each of the intermediate results lists basedat least in part on a number of computational nodes n with thequantitative probability being given by a statistical formula:${\begin{pmatrix}{s - 1} \\n_{1}\end{pmatrix}/n^{n_{1}}},$ wherein s is the number of aggregatedresults.
 13. The apparatus of claim 12, wherein the corpus of documentsis evenly distributed among the corresponding segments of the index. 14.The apparatus of claim 12, wherein each of the corresponding segments ofthe index of the corpus of documents references documents randomlyallocated to the segment of the index.
 15. The apparatus of claim 12,wherein each of the corresponding segments of the index of the corpus ofdocuments references documents pseudo-randomly allocated to the segmentof the index.
 16. The apparatus of claim 12, wherein the federator isconfigured to determine the number of results to retrieve from each ofthe intermediate results lists based at least in part on the aggregatednumber of results in the output result list.
 17. The apparatus of claim12, further comprising a formatter coupled to the federator andconfigured to generate a response to the query based on the outputresult list.
 18. The apparatus of claim 17, wherein the formatter isconfigured to retrieve at least a portion of documents identified in theoutput results list.